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Abstract. An effective p-adic encoding of dendrograms is presented through 
an explicit embedding into the Bruhat-Tits tree for a p-adic number field. This 
field depends on the number of children of a vertex and is a finite extension 
of the field of p-adic numbers. It is shown that fixing p-adic representatives of 
the residue field allows a natural way of encoding strings by identifying a given 
alphabet with such representatives. A simple p-adic hierarchic classification 
algorithm is derived for p-adic numbers, and is applied to strings over finite 
alphabets. Examples of DNA coding are presented and discussed. Finally, new 
geometric and combinatorial invariants of time series of p-adic dendrograms 
are developped. 



1. Introduction 

A dendrogram is often the output of a hierarchical classification algorithm. In 
the usual agglomerative methods, it is obtained from data by a distance function 
which is adjusted after each iteration to the clusters obtained in the previous step. 
Classically, the distance is euclidean, and the hierarchical structure is fitted to 
the data. The analyst then has to decide by other means whether the resulting 
dendorgram represents the underlying hierarchical structure of the data, or not. In 
the p-adic world, however, there is no ambiguity concerning the interpretation of 
dendrograms. The reason is that the p-adic distance is ultrametric. This has the 
effect that a p-adic dendrogram correctly represents the hierarchies within a given 
set of p-adic numebrs, of course with respect to the p-adic metric. Another effect 
is, as we will show, that p-adic classification is algorithmically much simpler than 
its classical counterpart. The consequence for data mining lies in the shift from 
classification to data encoding. 

If the dendrogram X is known, then its p-adic encoding can be effected by 
associating paths from the top cluster down to the data with p-adic numebrs. This 
is in fact an embedding of X into the p-adic Bruhat-Tits tree which can be seen 
as a "universal dendorgram" . This embedding will be made precise in this article. 
Strings over an alphabet are the only instance known to the author, in which p-adic 
data encoding can be realised in a straightforward manner. The encoding depends 
on the coefficients in p-adic expansions associated to the alphabet. Examples of 
p-adic DNA encoding are proposed and discussed. 

Time series of p-adic dendrograms give rise to new geometric invariants, namely, 
if translations along geodesic lines in the Bruhat-Tits tree can be identified, a 
discrete group action can be estimated in important cases. This action then leads 
to a dynamic system on a so-called Mumford curve, the p-adic analogon of a riemann 
surface. Studying this dynamic system will yield pararmeters whic can be used e.g. 
for extrapolating hierarchical data in time. 

Possible applications of p-adic dendrograms are coding theory of graphs and 
strings. Another area of application can be spatial reasoning and querying, in- 
cluding space-time issues. The time series point of view is naturally applicable to 
strings. The idea of studying p-adic dendrograms is taken from 8 . Linear fractions 



Date: February 1, 2008. 



1 



2 



PATRICK ERIK BRADLEY 



are considered in |P in the p-adic and real case simultaneously. A description for a 
general audience of the p-adic Bruhat-Tits tree and some of its discrete symmetries 
can be found in [3 . 

2. Embedding a dendrogram into the p-ADic Bruhat-Tits tree 

In order to embed a dendrogram X into the p-adic Bruhat-Tits tree, we first 
define X as the dendrogram for its data plus an extra point oo. The reason is 
that in this way the top cluster becomes the vertex uniquely determined by oo and 
two data points at maximal distance. This viewpoint leads to the term projective 
dendrogram, and we will see that in the p-adic case, it is associated to the p-adic 
projective line minus the p-adic numbers representing the data and oo. 

2.1. Abstract dendrograms. Dendrograms represent hierarchies within data, 
and are therefore trees, i.e. graphs without loops. Subsets of data points are clus- 
ters represented by the vertices, and inclusions of clusters are represented by paths 
between the corresponding vertices. It is useful to distinguish between clusters and 
data in the same way as one distinguishes sets from their elements: even if certain 
clusters are singletons, they are nevertheless not data in the same way as the set 
{x} is in a strict sense not the same thing as the point x. Hence, in our viewpoint, 
the data will not be part of, but at the boundary of a dendrogram. Hence, we will 
allow graphs to have unbounded edges. 

Definition 2.1. A graph is a quadruple V = (r°, F', d, l), where and F' are sets, 
9: F' — > F" is a map, and t: F' — > F' is an idempotent map (i.e. l o l = id) . The 
elements of F" are called vertices, and those of F' flags, d is called the boundary 
map, and l the inversion. A graph F is finite if and F' are both finite. 

The inversion yields an equivalence relation on the set of flags: Fi ~ F2 iff 
Fi = F2 or Fi = jX-Pj)- The equivalence classes under ~ are called the edges of 
F. The set of edges is denoted by F^ = F'/ ^. An edge is called unbounded if it 
consists of a single flag, otherwise it is called internal. We denote the set of internal 
resp. unbounded edges by Fq resp. Fj!,^. 

A graph F has a topological model |F|, obtained by identifying each flag F with 
the half-open interval [0, 1)f and pasting dF to F, and then taking the quotient 
by ~. This model reflects important topological properties of the graph, such as 
the number of connected components or the number of "holes" , i.e. minimal loops 
in F. These quantities are known as the Betti numbers ho{\T\,M.) and /ii(|F|,R) 
from algebraic topology, where they are introduced as the dimension of certain real 
vector spaces. For finite graphs, there is an important formula which relates the 
Betti numbers to the combinatorial data: 

/io(|r|,M)-/ji(|F|,R) =#F"-#Fi, 

known as the Euler formula. 

A graph F is a tree if it is connected and without loops, or, equivalently, if the 
Betti numbers of |F| satisfy ft,o(|r|,K) = 1 and /ii(|F|,]R) = 0. 

A rooted tree is a pair [T,v), where T is a tree and f G T*^ is a vertex. The 
distinguished vertex v is called the root and makes a rooted tree (T, v) into a directed 
tree by orienting all edges away from v, i.e. an internal edge e with boundary 
{wi, W2} is oriented from wi to W2, if wi is closer to v than W2 is, and an unbounded 
edge e' is oriented away from the unique vertex de' . 

By an abstract dendrogram we mean a finite rooted tree T = (T, v), all of whose 
vertices originate in at least two edges (unbounded or not). It is labelled, if its 
unbounded edges are labelled by some bijective map A: T^^ — > L, where L is a 
set whose elements are called labels. A projective dendrogram is a labelled abstract 
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Figure 1. Projective dendrogram with three datapoints. 



dendrogram T whose root originates in an umbounded edge labeUed oo and at least 
two more edges. The unbounded edges not labelled oo are called the datapoints or 
data underlying T, and will be denoted by D. Figure [T] illustrates a projective 
dendrogram with three data points. The reason for calling it "projective" will 
become apparent in later subsections. 

Remark 2.2. To call unbounded edges of a dendrogram T "data" seems to he in 
contradiction to the initial purpose of having data not be part of a dendrogram. 
However, by looking at the topological model, we see first that an unbounded edge is 
nothing but a half-line in the tree T underlying T . The ends of T are equivalence 
classes of halflines, where halflines differring only in finitely many internal edges 
are equivalent. According to graph theory, the ends form the boundary dT of T. 
Hence, we have in fact identified dT = T^ with data and oo. 

Given some projective dendrogram T = (T, v, A, D), there is an order relation < 
on T^ \ A^^(oo): e < e', if e lies on the path from oo to e'. If e and e' originate in 
some vertex v, impose any total order <„ on the edges originating in v in order to 
break ties. This extends to a total order on data as follows: the lexicographic order 
with respect to < and all <^ on the reduced words oo • • • e associated to minimal 
paths We (i.e. paths without backtracking) from oo to e induces a total order on D 
via the unique bijection between D and {we \ e G T^} induced by A. A projective 
dendrogram together with a total ordering < on its data is called ordered. We call 
a minimal path in a tree geodesic. 

A metric on a projective dendrogram is a function /x: Tq — > N \ {0}, and defines 
in an obvious manner a distance d: x ^ N. This induces a level structure 
— > N, w 1-^ £{w) = d{v, w), where v is the root. Figure [5] displays a projective 
dendrogram with level structure and oo on top. 

2.2. The binary case. Let X = {T,v,D, be an ordered projective dendro- 
gram with metric. In this subsection, we assume that T is a binary tree, i.e. each 
vertex w G has precisely two (internal or unbounded) directed outoing edges 
eo(w) < ei(w). With ep := eo(w) and ei :— ei{v) we have that T \ {v} is the 
disjoint union of the two branches Tq 3 eg and Fi 3 ei. Fq and Fi are themselves 
projective dendrograms, if the Ci are labelled oo^. We define functions 



Together, xo and xi define a function — > {0,1} such that xi^o) = and 
x(ei) = 1. This extends to a function on the set ^(oo, D) of directed geodesies 
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Figure 2. A projective dendrogram with level structure, 
from oo to any datum x £ D 

where o(e) denotes the origin vertex of the edge e, and £ is the level function on 
X. Together with the identification Q{oo, D) = D, we obtain the 2-adic encoding 
X : _D ^ Q2 of binary data. 

Remark 2.3. The coding function x 'is in fact lj2-'valued. Even more, its values 
are natural numbers, because X is finite. By construction, the values and 1 are 
taken by x for any projective dendrogram. In fact, xi^a) = and xi^n) — ^, if 

D ^ {X(3 < ■ ■ ■ < Xn}. 

Example 2.4. The dendrogram in Figure\^has the 2-adic encoding x given by: 

xi = 0, 2^, X3 ^ 2^ X4 22 

= 2^ + 24, xe = 2^ + 2^, xt^2" + 2\ xs ^ 1. 

This encoding differs slightly from the 2-adic encoding of the same dendrogram in 

2.3. The Bruhat-Tits tree for p-adic fields. It was observed that an encoding 
of dendrograms with p-adic numbers from Qp leads to considering subtrees of the 
Bruhat-Tits tree c5q Here, we intend to prepare an effective embedding of 

dendrograms into the Bruhat-Tits tree, which is going to be made precise in Section 
12.41 The preparation consists in reviewing the construction of and the variants 
for finite extension fields K of Qp, as the latter turns out too small in general for 
encoding data. 

2.3.1. The Bruhat-Tits tree for Qp. The p-adic field Qp can be defined as the field 
of Laurent series 

00 

^ Qt^p", e {0,...,p-l}. 

if——m 

It is well known that the p-adic norm induces a topology on the field Qp which makes 
it into a totally disconnected space. This is, however, compensated by the fact that 
p-adic discs never overlap. Hence, the ultrametric inequality provides us with a tree- 
like topology on the set of discs. It is precisely this hierarchical structure of discs 
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Figure 3. The Bruhat-Tits tree for Qz- 



which makes p-adic numbers interesting for hierarchical classification. Consider the 
unit disc 



It has a unique maximal ideal pZp, and this ideal coincides with the maximal "open" 
(non-trivial) subdisc {x G Qp \ \x\p < 1}. It is a standard fact from algebra that 
the quotient of a unital commutative ring by a maximal ideal is a field. In our 
case, Zp/pZp = ¥p, the finite field with p elements. This is well known and follows 
from the fact that the unit disc is covered by the finite number of translates of the 
subdisc pZp-. 



which says in a fancy way that there are precisely p choices for the constant term 
in the power series expansion of any p-adic integer. Hence, we have a hierarchical 
structure of a disc with p maximally smaller subdiscs. By rescaling and translation, 
it follows immediately that any p-adic disc has precisely p smaller subdiscs which 
are maximal as subdiscs. Observe that pZ, has precisely one minimal bigger disc 
containing pZp, namely D. Again, this holds for all p-adic discs. The consequence 
is that the set of all subdiscs of Qp form a,p+ 1-regular tree , called the Bruhat- 
Tits tree for Qp. Figure [3] shows an illustration of taken from [3^ Fig. 5]. 

2.3.2. Bruhat-Tits trees for p-adic number fields. The field K of real numbers is 
complete with respect to the archimedean distance In the same way is the 
field Qp complete with respect to |-|p. However, neither M nor Qp is algebraically 
closed. In the archimedean case, the algebraic closure of R is the field C of complex 
numbers, and C is a two-dimensional vector space over the scalar field JR. By 
definition, the degree of a field extension L over K (meaning if is a subfield of a 
field L) is the dimension of L as a vector space over the scalar field K. If that 
degree is finite, then L is called a finite extension of K. Hence, the degree of C over 
M is 2, and M has no other finite field extensions. In contrast, Qp has extension 



B = {xeQp: \x\p < l} = Si(0). 
It is a subring of Qp which coincides with the ring of p-adic integers 




Zp ^ [J{x +p'Lp), 
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fields of arbitrary degree. Hence, the algebraic closure of Qp is an infinite extension 
of Qp. Assume that a finite extension field K of Qp of degree n be given. Then 
it is known that the distance \-\p extends uniquely to a norm \-\k on K, and K is 
complete with respect to \-\k |6', §5.3]. Again the unit disc 

Ok = {xe K: \x\k < 1} 

is a ring with unique maximal ideal 

tTiK = {x e K: \x\k < 1}, 

and K = Ok/vu-k is a finite field extension of Fp, called the residue field. It is finite 
with p-^ elements, if / is the degree of k over Fp. In general, the degree n is not 
smaller than /, but if n = /, then K is called unramified over the subfield Qp. A 
finite extension field of Qp is also called a p-adic number field, and the elements of 
Qp are sometimes called rational p-adic numbers. 

In any case, if if is a p-adic number field with k = Fp/ , then in the same manner 
as with Qp the unit disc Ok is covered by p^ translates of the subdisc m-K- This 
gives rise to the Bruhat- Tits tree S/'k for K which is an infinite p^ + I-regular tree. 
In other words, the number of edges emanating from a vertex of S^k depends on 
the residue field k which can be the same for different extension fields of Qp. So, 
the choice of unramified extensions is in some sense optimal for constructing the 
Bruhat-Tits trees. 

In general, p will not be a prime element of Ok , but this is true in the unramified 
case. In fact, for K unramified of degree / over Qp, it holds true that m.K = pOk, 
and every element of Ok has an expansion 

oo 
u=0 

where is a system of pf representatives modulo pOk- However, in the ramified 
case, p is not a prime in Ok- But also in this case, the maximal ideal xxik is of 
the form ttOk for some prime tt G Ok- It is always possible to choose tt such that 
\n\K = p~^l'^ for some natural number e > 1 6, §5.4]. This number e is called the 
ramification index of K over Qp, and K is ramifed over Qp, if e > 1. The extension 
K over Qp is called purely ramified, if f — 1. 

2.3.3. Cyclotomic p-adic fields. It is known that Qp contains the p — 1-st of unity, 
but not the p^ — 1-st roots of 1 for / > 1. Therefore, we discuss the fields Qp(C) 
obtained by adjoining to Qp the n-th roots of unity which are all powers of C, a 
primitive n-th root of 1. We will first consider the case that p is prime to n. In that 
case, Qp(C) is unramified over Qp, the degree is the smallest number / such that 
pf = I mod n, and {l,C, . . . ,C'''~"^} represents an Fp-basis of k = Fp/ in Oq(^Q 
which equals the polynomial ring Zp[^] [TTJ 11.(7.12)]. That means, we can choose 

9\^i^J2a,C-a, e {0,...,p-l}| 

as a system of representatives which is in bijection with a subset of N-''. 

The most economic choice for n is certainly p-^ — 1. In that case, Q(C) is again 
unramified of degree / over Qp, and 

9^T-{o,i,c,...,C^'"'} 

is an alternative set of representatives for k = Fp/. This is a consequence of 
Hensel's Lemma [Bj Thm. 5.4.8] (cf. [SI §5.4]. The elements of 91t are called the 
Teichmiiller representatives of Fp/ and have the characterising property that the 
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residue class of ( generates the multiplicative group ¥^f = Fp/ \ {0}. For example, 
if / = 1, then already Qp contains the p — 1-st roots of 1 which form the Teichmiiller 
representatives in that case [H Cor. 4.3.8]. 

Another important case is when is a primitive p^-th root of unity. Then Qp(C) 
is purely ramified over Qp of degree e = {p — !T1, 11.(7.13)]. 

2.4. Algebraic p-adic dendrograms. The reason for introducing the Bruhat- 
Tits tree also for finite extensions K of Qp is that the number of children 
of a vertex can in principle be unbounded. This means that K must be taken 
sufficiently large in order for a dendrogram to be embeddabe into 5^. In this 
subsection, we will effect the embedding, define p-adic dendrograms and discuss 
these from a geometric perspective. 

2.4.1. Cyclotomic encoding. Let X = {T,v,D,fi) be a projective dendrogram. By 
the children ch(v) of a vertex w ^ we mean the outgoing edges of v in which 
are not labelled oo. Let 

m = max {#ch(?i)) | w G T"}, 

and / minimal such that m < . K = Qp(C) will denote the cyclotomic field with C, 
a primitive p^ root of unity, and assume that is a full system of representatives 
modulo pOk containing and 1. Generalising the binary case, Xw ■ ch{w) — > ?l is 
now an inclusion map for every vertex w ^T'^ such that G imx«>, and 1 G vcaxv- 
These maps form together a map x'- ~* which yields a p-adic encoding map 

where £ : N is the level map derived from the metric ^ as in Section 12.11 

Again, as in the binary case, the natural identification D ^ ^(oo, D) yields a p- 
adic encoding x'- D K oi the data. In the case m = 2, we recover the binary 
encoding as in Section 12.21 for ordererd dendrograms, if the local encoding maps 
Xw ■ ch(w) {0, 1} are chosen appropriately. 

Remark 2.5. // a dendrogram is binary, or the prime p is sufficiently large (not 
smaller than the largest number of children of any given vertex), then a rational 
p-adic encoding is possible. In this case, data will he represented by finite p-adic 
expansions, hence by natural numbers. Restricting to rational p-adic encoding has 
the disadvantage that p is a fixed bound for the number of possible children vertices 
in dendrograms. Hence, if there is no a priori bound in data, then it is necessary to 
allow unramified extensions of Qp of arbitrary degree. From a computational point 
of view it is probably most interesting to keep the prime p as low as possible, i.e. 
p = 2. 

2.4.2. Dendrograms and the p-adic projective line. Let X be a projective dendro- 
gram. In Section 12.4. If we have constructed an embedding of the underlying data 
D K into a p-adic number field K. From a geometric viewpoint, this is an 
embedding of D into the p-adic afEne line. It is often convenient to treat points 
on the affine line and oo on an equal footing. The geometric space enabling this is 
the projective line. Hence, we consider a p-adic number from _ftr as a point in the 
p-adic projective line P"'^. The space is a p-adic manifold defined over Qp and 
whose if-rational points are given by V^{K) — KU {oo} for any field K containing 
Qp. Here, we consider only the case that i^T is a p-adic number field. 

In Section 12.31 we have constructed for each K the Bruhat-Tits tree by 
associating to each disc in a vertex of 5^. To an inclusion B C B' oi disks 
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corresponds a geodesic path between the associated vertices w and w' . It is a fact 
that any strictly descending infinite chain of discs in K 

(1) B1DB22... 

converges to a i(r-rational point: {x} = f]Bu with x E P^{K). In the tree ^k, the 
chain ^ corresponds to an infinite geodesic half-hne 

• • . . . 

and the p-adic number x lies at its end. It is a well known fact that the ends of 
correspond bijectively to F^{K). Now let S C F^{K) be a finite set of p-adic 
numbers. Then we can form the *-tree which is the smallest subtree T*{S) of 5^ 
whose ends correspond to S. li S\ {00} C and S contains and 1, then T*{S) 
can be made in a natural way to a projective dendrogram. First observe that 
three disctinct points x,y,z G P^(i^) define a unique vertex v(x,y,z) in 3/'k'- it is 
the intersection of the three geodesies ending in {a;,y, z}. Hence, T*{S) contains 
the vertex v = u(0,l,oo) corresponding to the unit disc Ok- Choose v as the 
root. As 5 C Ok U {00}, all vertices on the half-line ]w,cxd[ have precisely two 
emanating edges in T*{S). By defining Tg to be the set of vertices w in T*{S) 
with #ch(u;) > 2, and Tg the set of geodesic paths ]w,w'[ between w G Tg and 
w' ETgUS not containing a vertex from Tg, we obtain a tree T5 whose data Ds are 
the half-lines ]w, s[ with w S Tj, w € S \ {00} and not containing any vertex from 
Tg. This yields a projective dendrogram Xs = (Ts, Ds, ^s), where the metric /x 
is defined as the number /i(e) of edges in on the geodesic path corresponding 
to e G Tg. It is clear that there is a natural p-adic encoding xs ■ -D5 — > S with 
numbers from K. 

A tree in which every vertex v has more than two emanating edges is called 
stable. This is, in a way, a kind of minimal representation of a tree. In that sense, 
Ts is the stabilisation of T*{S). It is clear that for any projective dendrogram 
X ^ {T,v, D, n) a p-adic endoding x- D A' as in Section [2.4.11 vields a tree 
T*{x{D)) whose stabilisation is tree-isomorphic to T. Hence, a p-adic encoding 
X means in p-adic geometry an embedding of projective dendrograms into .^k for 
some p-adic field K through the assignment 

E:X^¥'\{X{D)U{^}). 

This assignment is a map from the space 2)„ of dendrograms on n data to the space 
9Jto,n+i of n -|- 1-pointed projective lines. Any pointed projective line is, by means 
of a projective linear transformation, represented by \ {xq, . . . ,a;„} such that 
xq — 0, xi — 1, X2 — 00. 

Definition 2.6. A p-adic dendrogram is a pair (AT, x) with a projective dendrogram 
X and a map %: Z? — > A" into some p-adic number field K such that there is an 
isometric isomorphism between T*(im(x) U {00}) and the underlying tree of X. A 
p-adic dendrogram is called normal if 0,1 G im(x) C Ok- 

2.4.3. Binary data are generic. The space J)„ is known to be a polyhedral complex 
of dimension n — 2, and the cells of maximal dimension consist of the dendrograms 
whose underlying trees are binary. In fact, the dimension equals the number of 
internal edges (which can be of arbitrary length), and for binary dendrograms this 
number is n — 2. The other dendrograms are all contained in lower dimensional 
cells. As the latter are obtained by contracting edges of binary dendrograms, the 
corresponding cells in S„ are always in the boundary of cells of maximal dimension 
n — 2. Hence, binary dendrograms are generic. 

The *-tree construction from Section 12.4.21 gives a map 

6: OTo.n+i ^Sn, pi\^^T*(if} 
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where C P^{K) is assumed to contain 0, 1, oo. This map is the Tate map and is 
many-to-one. In fact, its fibres are open in the analytic topology of QJto.n+i- The 
relation to the above is that the p-adic encoding map S : X)„ S[)To,„+i is a section 
of the Tate map, i.e. o S = idx)„. 

From this geometric point of view, a dendrogram is merely a point in Tin, 
and a p-adic dendrogram is determined by a point x G 9Jto.n+i- A time series 
of dendrograms is given as a map {xq, ■ ■ ■ , xn} — s- Tn or, in the p-adic case, 
{a;o, ■ • ■ ,xn} 2)to,n+i- 

In what follows, we consider w.l.o.g. p-adic dendrograms. In general, a family of 
dendrograms with n data is given by a map S 9Jlo,n+i for some p-adic space S. 
By p-adic geometry, there is an associated continuous map S — > for some real 
space S depending on S. The space S is viewed as a parameter space for the family: 
small variations in S yield nearby dendrograms. Hence, we can speak of a small 
deformation of dendrograms: this is a family of dendrograms such that S !D„ 
maps into a fixed cell. In that case, the topological type of each dendrogram 
parametrised by S (or E) is always the same, only the lengths of internal edges 
vary in the family. More details on families of dendrograms can be found in [TJ [5] . 

3. Classification of strings 

By using a finite unramified extension K of Qp, one obtains an encoding of data 
by finite expansions in powers of p and coefficients in a system of representatives 
modulo pOk- We will always assume that contains and 1, so that we can then 
speak of polynomials in p over [H. We denote by the set of all such polynomials. 

3.1. Cyclotomic p-adic encoding of strings. Let A be some finite alphabet, 
and S{A) the set of all possible strings using letters from A. In other words, S{A) 
is the set of infinite sequences of letters from A. We will interpret finite strings 
also as infinite sequences by assuming that A contain a distinguished letter or 
"blank" , and a string is finite, if only finitely many of its letters are not blank. The 
set of finite strings is denoted by Sn^i-A)- S{A) is endowed with the ultrametric 
Baire distance 

5p: S{A) X S{A) -> M, {x,y) ^ inf {p"" | x[n] = y[n]} 

where z[n\ denotes the sequence of the first n letters in the string z. Usually, 62 
is used as the Baire distance. It is an ultrametric which resembles very much the 
p-adic distance. In any case, it is an easy exercise to prove that {S{A),Sp) is a 
complete metric space and that S'fin(^) is a dense subspace of S{A). 

Theorem 3.1. There exists a p-adic number field K unramified of degree f over 
Qp, a full system d\ C Ok of representatives modulo pOx, and a closed isometric 
embedding (p: {S{A),Sp) — > (0/^,1-1/^) which takes Sfin{A) into [H[p] C Ok- The 
set (l){S{in{A)) is dense in im(0). 

Proof. Take / sufficiently large, and idenfify A with a subset of 91 in such a way 
that the blank maps to g EH. Clearly, the distances coincide after identification, 
and the statements of the theorem follow from this. □ 

Remark 3.2. The isometric map cf) in Theorem \ 3 . 1\ identifies S{A) with a so-called 
affinoid disc, i.e. a closed disc with "holes". In fact, im((/)) is the unit disc Ok 
minus the preimage of some points of k under the canonical projection p : Ok ~> 

OpIpOp = K. 

Example 3.3. Consider the strings in the letters {A, G, C,T} representing DNA 
sequences in the four nucleotides adenine (A), guanine (G), cytosine (C) and 
thymine (T). In \il a rational b-adic model for such strings is discussed, and 
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combined with a 2-adic distance. The encoding in [31 identifies the nucleotides with 
{1,2,3,4}. Hence, the code alphabet is A = {0,1,2,3,4}, and the finite rational 
5-adic numbers represent all finite lists of nucleotides with arbitrarily long spacings 
between them. 

We show that one could use a model based on a single prime, namely p — 2, 
using an extension field K finite over (Q2- 

As we are using four letters, the 2-adic field (Q)2 is too small, because its residue 
field ¥2 has only 2 elements. However, a 2-adic field K with residue field k = F22 
would be precisely sufficient, if we do not care about blanks. This can be realised 
thus: take a primitive third root C. of unity, and let K = Q2 (C) ^6 the corresponding 
cyclotomic field extension. By number theory, K is unramified of degree f — 2 over 
Q2j because 2^ = 1 mod 3, and f ^ 2 is minimal with that property. 

As K is unramified over Q2 , 2 is a prime of Ok — '^2[C,]- ThendKx — {0,l,CiC^} 
is the system of Teichmilller representatives for Ok 1^20 k — ^22, and we have 



Now, any bijection {A,G,C,T} = 9^t yields a 2-adic encoding of DNA. However, 
this method does not distinguish between and blank, so it is never clear, how long a 
string represented by a 2-adic number is supposed to be. On the other hand, there is 
already an existing proposal in [7\ based on the single prime 2. There, the bijection 



is proposed. If we take the isomorphism = F22 defined by (1, 0) 1, (0, 1) > ^, 
where C S F22 is the residue class of this amounts to encoding DNA by d\ = 
{0, 1, Cj 1 + C}; '^'^^ ^6 obtain the bijections of ordered sets 



The authors of [7] consider only words of fixed length 3, wherefore the question 
"0 — blank?" does not arise there. 

In any case, if we take f — 3, then Q2(C) certainly large enough to include 
"blank" in our 2-adic alphabet for DNA. 

Next, we observe that cyclotomic encoding is persistent: 

Theorem 3.4. Every finite alphabet has a cyclotomic encoding for every prime p. 

Proof. We need to show that for aU / there is a natural number n such that / 
is minimal with p-^ = 1 mod n. Taking n — p-^ — 1 sufficiently large proves the 
assertion. □ 

Remark 3.5. Note that arbitrary sets of strings form dendrograms for the Baire 
distance which in general are not normal. In the finite case, it is possible to make 
the dendrogram normal by a shift (which corresponds to multiplication by some 
p-adic integer). 

Remark 3.6. The authors of |10j consider a variant of the Baire distance on 
strings. Namely, for k > 1 let 



which they consider for p = 2. This distance does not distinguish between strings 
with a common prefix of length k or more. Equivalently, the corresponding p-adic 
numbers are not distinguished, if their p-adic distance equals dk{x,y) or less. 




{A,G,T, C}->F2, 

A^(0,0),G^^(0,1), T^(1,0), C^(l,l) 



(A,G,T,C7)-(0,C,1,1 + C)-(0,1,C,C')- 



dk{x, y) = inf |p " | x[n\ = y[n], < n < fc} 



MUMFORD DENDROGRAMS 



11 



3.2. A hierarchic algorithm for strings. The main advantage of strings is tiiat, 
by Theorem 13.41 the extension field K can be a priori chosen as a cyclotomic field 
of fixed degree, as it is determined by the size of the alphabet. 

3.2.1. General description. A solely p-adic agglomerative hierarchic algorithm for 
strings is now presented which does without the changing of the distance function 
usual in the archimedean case. The reason is that by the ultrametric triangle 
inequality, the distance between two disjoint discs B and B' equals the distance 
between any two representatives x ^ B and x' G B' . It can be essentially broken 
down into two steps. 

Step 1. Encode strings by p-adic numbers from the cyclotomic field K. 

Step 2. Classify p-adic numbers using |-|a'. 

Step 1 has been described in the previous subsection, and Step 2 will be explained 
below. The output is the uniquely determined *-tree for the given strings. 

Remark 3.7. The algorithm in Step 2 is independent of whether thep-adic numbers 
encode strings or not. In fact, it merely classifies p-adic numbers. Hence, the focus 
in p-adic hierarchical classification of data which are not to be taken as strings lies 
in the analogue of Step 1 which is unsolved as far as the author is aware. 

3.2.2. The classification algorithm. Let D — {xi, . . . ,x„} be a set of n different 
p-adic numbers. We assume that these are taken from some cyclotomic p-adic field 
K = QpiO- In the special case K = Qp, encoding by natural numbers might be 
tempting. Then the euclidean algorithm will yield the p-adic expansion. So, we 
assume that all x G D are given by their p-adic expansion with coefficients in a full 
system 91 of representatives modulo pOx- 

First note, that the computation of \x\k is simple, if x is given by the p-adic 
expansion. 

1. Take all x & D \ {xi} with \xi — x\k minimal to form the cluster C{xi). Do 
the same with all other Xi G D and obtain the clusters C{xi), . . . ,C{xn) together 
with all possible inclusions among these clusters. 

2. Let D' be the set of all maximal clusters among the C{x) from 1. Proceed with 
D' in the same way as with D in 1. using the p-adic distance between clusters. It 
is, by ultrametricity, given by \x — y\K for any points x, y representing the clusters. 
Obtain m < n clusters and their hierarchy. 

3. Let D" be the set of all maximal clusters among the ones obtained in 2. Etc. 
As in each step, the number of clusters obtained strictly decreases, the algorithm 

terminates with one single cluster. This must be D, as otherwise one would go on 
clustering. Putting togeter all hierarchies yields the tree T*{D), at least topologi- 
cally. Taking some extra care in each step, yields the metric or level structure on 
T*{D), as can easily be seen. 

Remark 3.8. Clearly, for a given set S of strings, the output T*{D) depends on 
the encoding S D <Z 5H[p] of the strings. So, if 91 is replaced by a different set 
9{' modulo pOx, then we obtain another encoding x' '■ S ^ 9^'[p] by composing with 
any bijection 0: £H ^ x' — 4* ° X- Assume that the change of coefficients 4> is 
such that it takes any element a G representing a G n to 4>{a) G d\' representing 
the same a in the residue field k, i.e. there is a commutative diagram 



4> 

^m' 
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Figure 4. The f-tree of the dendrogram in Figure [H 

where psR and ptj^i are the restrictions of the canonical projection p : Ok — > k to 91 
and respectively. Then the corresponding *-trees are isometric, hence yield the 
same dendrograms. 

4. Discrete symmetries of time-series 

In a time series of dendrograms Xt with fixed number of ends, we consider the 
underlying data Dt as a set of particles which "move" with respect to another in 
"time" t, e.g. by using the same set of labels for all data Df. Naturally, a series 
of lists of strings can be considered as such a time series. Fixing the data size 
means that we assume there to be no collisions among particles (cf. j.lj for colliding 
particles) . Another technical simplification we make is that we consider only binary 
dendrograms. The general case will be treated elsewhere. 

4.1. The |-tree. The tree T underlying a dendrogram X has an important subtree 
which depends on the data: namely, the subtree spanned by the vertices of T. Its 
p-adic counterpart is the subtree (S) of T* (S) spanned by the vertices v{x, y, z) 
where x,y, z ^ S F^{K) are three distinct points. We will also speak of a f-tree 
when meaning T^. For example, the f-tree of the dendrogram in Figure [T] is a 

segment • •, and for the tree T underlying the dendrogram in Figure [2] is 

shown in Figure [H where the numbers indicate the edge lengths. 

The f-tree gives a rough idea on the distribution of the distances within data. 
We define for this end the volume of (or a dendrogram X = {T,v, X, D, p)) as 
the total length of its edges: 

voi(rt) ^ J2 A^(^)- 

eeTt.i 

Each child e v gives rise to a branch F consisting of all geodesic paths beginning 
in the target vertex of e and directed away from v. The weight of branch F is now 
defined as 

w;(r) = Vol(r) + p{e) ^ Vol(rt) + ^(e). 

The influence on the dendrogram X is measured by a complex number we call the 
balance of X : 

m — 1 

b{X) = w^e^^ e C, 

i/=0 

where Fq, . . . ,Tm-i are the different branches and w,y ~ w{T^). X is balanced, if 
b{X) = 0. This occurs if and only if all weights Wi, are equal. 

Example 4.1. The dendrogram X in Figure \^ has the following values for the 
quantities: 

Yol{X) = 9, w(Fo) = 8, w{Ti) = 1, b{X) = 7. 
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Studying the balance b{Xt) of a time series Xt gives a first indication on the 
behaviour of the growth of individual branches. 

Remark 4.2. The f-tree of a p-adic dendrogram indicates the amount of freedom 
one has for the coding map x- Namely, data D can be given any p-adic values ^ 
or 5£' , as long as 

U {oo}) = Tt(if' U {oo}) 

holds true. This means for the p-adic expansions that the coefficients of the high 
powers of p can be chosen arbitrarily from 91. 

4.2. Time-invariant subtrees of p-adic dendrograms. Any edge e in the un- 
derlying rooted tree {T,v) of a dendrogram X defines a branch T^. It is itself a 
dendrogram with data -De and is the union of e and the subtree of T spanned by all 
vertices of T below e. In order to be able to compare the evolution in time of den- 
drograms, we assume that we are given a family F : {X{0), . . . , X{N)} — > £!Jlo,n-i-i of 
normal p-adic dendrograms with n data. In this case, t; is a fixed vertex of the time 
series F. This time seris F defines a family of subtrees T*{i) = T*{x{Di) U {oo}) 
of the Bruhat-Tits tree for K a sufficiently large p-adic number field, where 
X is the coding map associated to F, and Di the data of X{i). A subtree F of 
some T^{i) is said to be time-invariant, if V lies in all T\i). A geodesic 7 =\a, b\ is 
time-invariant, if 7 lies in all T*{i) and at all times (a, b) represents the same pair 
of particles. A branch T of some T*{i) is time-invariant, if the path from v to the 
root Vt of r is a time-invariant subtree, and the data adherent to vr by paths away 
from V represent the same set of particles at all times t = 0, . . . ,N. 

4.3. Time series of genus one. The definition of balance uses as point of refer- 
ence the vertex v. In our considerations, it will play the role of a "fixed star" . 

Assume for convenience that we are given a binary p-adic time series, that is a 
time series of binary dendrograms Xq, . ■ . , Xn encoded in some p-adic number field 
K unramified over Qp. In the binary case, the balance of each X^ is given as 

b{Xt) = wo-wieZ, t = 0,...,N. 

The intersection T/n]0, 1[ is a segment It = [vo{t), vi{t)] and contains v. Assume 
that b{It) follows a linear trend with rational slope c = ^. If c 7^ 0, we may assume 
that d S Z, e € N and have no common divisor. We call c the velocity of the time 
series Xt along the geodesic path ]1, 0[. 

Consider w.l.o.g. the case c < 0. This means that there is a net flow of balance 
towards 1. 

Case Vi: w e It =]vo{t),Vi{t)[. This case will be called flow from infinity and can 
be interpreted as there being a balance flow from outside the data. This case will 
not be considered, although technically it should be similar to the following case. 
Case 3to yt > to: v = vo{t). Then Xt follows a translation r along ]0, 1[ with 
velocity c, and is the repcUant fixed point of r, and 1 the attracting fixed point. 
If e > 1, then r does not act on the tree ,3^, because translations on can be only 
by multiple shifts of edges from . However, if L is a p-adic number field which is 
purely ramified over K with ramification index e, then there is a Bruhat-Tits tree 

which is of the same regularity as ^k, and which topologically contains 
but in which every edge of ,!7k is subdivided into e edges of equal length i. 

Note, that an extension of p-adic number fields L over K is ramified, if there is 
a prime ttl of Ol such that for some e > 1 holds true: \ttl\l — I""/?!) where ttk is 
a prime of Ok. The number e is the ramification index. The extension is purely 
ramifed, if the corresponding extension of residue fields k l over kk has degree one. 
If K is unramified over Qp, then it follows in any case that \-kl\l = p* . 
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6(0) = 2 6(1) = 1 6(2) = -1 6(3) = -2 
Figure 5. A time series of dendrograms. 



Assume now that c = | 7^ with e > 1, and d prime to e. Let L be a p-adic 
number field which is purely ramified over K with ramification index e. Let Cp € L 
be such that \Cp\L ~ p'^ ^ 1 (e.g. Cp = p^, if this is an element of L). Then the 
translation t can be represented by the hyperbolic transformation 9: z (i-e %-i 
in the projective linear group PGL2(L). This transformation 6 can in turn be 
represented by the matrix 

VI -Cp -1 

Because 9 is hyperbolic, it generates a discrete subgroup H = (9) of PGL2(L) which 
acts on the p-adic manifold Gm = \ {0, 1}. The quotient E — Gm/H is defined 
over L and known as a Tate curve, i.e. a compact p-adic riemann surface of genus 
1, i.e. the p-adic analogon of the surface of a torus. Its L-rational points are given 
as 

E{L) = L^'/H ^ Ends(^L/ff), 

where Ends(0) denotes the set of ends of a graph Q. The quotient graph £ =]0, 1 [/ H 
is a loop, i.e. has first Betti number /3i — /ii(|i?|,R) — 1. And the time series Xt 
induces a dynamical system of vertex pairs on £. Also the Tate curve E is endowed 
with a dynamical system of L-rational points via its p-adic encoding. In the latter, 
the points of the dynamical system on E are the L-rational points given by the H- 
orbits of the ends of Tl encoding the data of Xt- Hence, for fixed p, the time series 
Xt has the invariants: c = ^, L, K (cyclotomic) , and (3i = 1. The corresponding 
p-adic time series obtained by encoding has the further invariant (assuming that 

p-eL) 

-p" 

1-p^ -1, 



e = 



which gives rise to the Tate curve E = Gm/{0), where 9 e PGL2(L) is the Mobius 
transformation associated to 8. 

Example 4.3. Consider the series of dendrograms as symbolically depicted in Fig- 
Mre[3 The time series of balances follows the recursion 

bit+l)^b{t) + ct, ct-- 

In the average, the balance increases each time by c — Hence, we have a 

translation on the real line by c with quotient graph £ a circle, and two vertices on 
£ representing the two orbits of the marked vertex • in each dendrogram of Figure\^ 
The loop obtained is depicted in Figure\^ where Weven represents the dendrograms 
at even times t, and Vodd o,t odd t. 
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Figure 7. Segment with two loops. 



4.4. Mumford curves. Here, we assume w.l.o.g. that if is a sufficiently large p- 
adic number field. By this we mean that the power by any fraction u which 
needs to be taken, lies in K. This implies that the u-th fraction of a path within 
is also defined over K, i.e. is a sequence of edges from 

Assume that a time series of binary p-adic dendrograms Xt gives rise to a dy- 
namical system on a Tate curve through an action on the geodesic ]0, 1 [ as described 
in Section 14.31 By translations, we can transform the dendrograms Xt in such a 
way to Xf that the segments Ij = Tpn]0, 1[ are all balanced, where T/ is the tree 
underlying Xf. If we now assume that /j = [v'Q{t) , v'i{t)] is approximately con- 
stant in time, then by a small deformation of the family X^ we may assume that 
^o(^) ~ ^0 ~ const. Let eo{t) be the edge originating in v'q and not lying in I^. 
If further /i(eo(t)) is approximately constant, then by another small deformation, 
we can assume that the time series Xj. has a fixed vertex wq which is the target of 
eo(i). This vertex is the root of the time-invariant branch TQ{t) of X'f . 

Having made this cascade of assumptions, there is one further assumption which 
takes us into the situation of before the introduction of time series of genus one. 
Namely, that wq lies on a time-invariant geodesic line ]a, b[ for some a, b in the data. 
In fact, the initial terms of the two p-adic numbers a and b are uniquely determined 
by the path v ^ wq. Continuing the p-adic expansion with zero coefficients yields 
a, and continuing with 1 and then zeros yields b — a+p"^ for some m larger than the 
highest power of p occurring in a. In the case that the conditions for constructing 
the Tate curve are fulfilled for the branches T'^it), we end up with ap-adic riemann 
surface of genus 2 because of the translation cr by a fraction u along the geodesic 
]a, b[C ^]^. In fact, cr is represented by an hyperbolic transformation ^ G PGL2(/f) 
with matrix 

fa - p^b (p" - l)ab\ 
yi-p" p^a-b ) ' 

Together with 0, we obtain a discrete subgroup F2 = {0,<,) of VGIj2{K) generated 
by 9 and The closure in of the union of the i^-orbits of 0, 1, a, 6 is a set ^ whose 
complement f2 = \ ^ is a p-adic manifold on which F2 acts, and the quotient 
C = n/i^2 is a p-adic riemann surface of genus 2. A p-adic riemann surface of 
genus 2 or higher is usually called a Mumford curve. The Mumford curve C comes 
again with a dynamical system on its iiT-rational points C{K) = Ends(,5^/F2) 
given by the orbits of the data. The smallest subtree T*{^) of such that 
Ends(T*(^)) = ^ is an F2-invariant tree, and the resulting quotient graph C is a 
finite graph with first Betti number /ii(|C|,R) —2 as illustrated in Figure [71 
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Remark 4.4. The fact that the geodesic lines ]0, 1[ and ]a, b[ in the Bruhat-Tits tree 
are disjoint is sufficient for the translations t, g to generate a discrete hyperbolic 
group {0, <;) C PGL2(iir), and hence give rise to a Mumford curve of genus 2. This, 
however, is not a necessary condition. In fact, if the geodesic lines intersect in a 
segment I , then the length of I must not be larger than the periods of t and a in 
order for the group of hyperbolic transformations to be discrete. In the non-discrete 
case, there is no Mumford curve obtained by the action on the projective line. The 
case of time series with time-invariant intersecting geodesies is a bit more involved 
and will be treated elsewhere. 

5. Conclusion 

We have studied dendrograms from the viewpoint of p-adic geometry, where com- 
binatorial objects are associated to spaces in a natural way. The space here is the 
p-adic projective line P^, and punctures of define a subtree of the p-adic Bruhat- 
Tits tree, i.e. a dendrogram whose data D are the punctures. This dendrogram is 
the hierarchic classification of the p-adic numbers from D with respect to the p-adic 
norm Due to the ultrametric property of |-|p, the classification algorithm for 
p-adic numbers is simple. Hence, the focus in data mining shifts from classification 
to p-adic data encoding, a task which in general is far from trivial. However, in the 
case of strings over a finite alphabet A, we have observed that the task becomes 
much simpler, because the lettres from A can be identified with coefficients in the 
p-adic expansion of numbers. Finally, we have introduced the genus g of a time se- 
ries of p-adic dendrograms by associating to it a discrete action on the Bruhat-Tits 
tree, exemplified in the cases g = 1 and g ~ 2. From this action, a finite quotient 
graph G can be constructed. Even more, the action yields a dynamical system on 
a so-called Mumford curve of genus g whose associated combinatorial object from 
p-adic geometry is G. These new invariants now await practical application in the 
study of time series data. 
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